智能论文笔记

VALUE: Understanding Dialect Disparity in NLU

Caleb Ziems , Jiaao Chen , Camille Harris , Jessica Anderson , Diyi Yang

分类：自然语言处理

2022-04-06

英语自然语言理解（NLU）系统已经取得了出色的表现，甚至在胶水和超级胶水等基准上表现出色。但是，这些基准仅包含教科书标准美国英语（SAE）。在NLP社区中，其他方言在很大程度上被忽略了。这导致偏见且不平等的NLU系统，仅服务于说话者的子人群。为了了解当前模型的差异并促进了更多的语言功能性的NLU系统，我们介绍了白话语言理解评估（Value）基准，这是我们使用一套词汇和形态句法转换规则创建的具有挑战性的胶水变体。在此最初版本（v.1）中，我们为非裔美国人白话英语（AAVE）的11个特征构建规则，并招募流利的AAVE扬声器，以通过参与性设计方式通过语言可接受性判断来验证每个功能转换。实验表明，这些新的方言功能可以导致模型性能下降。要运行转换代码并下载合成和金标准的方言胶水标准，请参见https://github.com/salt-nlp/value

translated by 谷歌翻译

Forecasting West Nile Virus with Graph Neural Networks: Harnessing Spatial Dependence in Irregularly Sampled Geospatial Data

Adam Tonks , Trevor Harris , Bo Li , William Brown , Rebecca Smith

分类：机器学习

2022-12-21

Machine learning methods have seen increased application to geospatial environmental problems, such as precipitation nowcasting, haze forecasting, and crop yield prediction. However, many of the machine learning methods applied to mosquito population and disease forecasting do not inherently take into account the underlying spatial structure of the given data. In our work, we apply a spatially aware graph neural network model consisting of GraphSAGE layers to forecast the presence of West Nile virus in Illinois, to aid mosquito surveillance and abatement efforts within the state. More generally, we show that graph neural networks applied to irregularly sampled geospatial data can exceed the performance of a range of baseline methods including logistic regression, XGBoost, and fully-connected neural networks.

translated by 谷歌翻译

FAIR AI Models in High Energy Physics

Javier Duarte , Haoyang Li , Avik Roy , Ruike Zhu , E. A. Huerta , Daniel Diaz , Philip Harris , Raghav Kansal , Daniel S. Katz , Ishaan H. Kavoori

分类：机器学习

2022-12-09

The findable, accessible, interoperable, and reusable (FAIR) data principles have provided a framework for examining, evaluating, and improving how we share data with the aim of facilitating scientific discovery. Efforts have been made to generalize these principles to research software and other digital products. Artificial intelligence (AI) models -- algorithms that have been trained on data rather than explicitly programmed -- are an important target for this because of the ever-increasing pace with which AI is transforming scientific and engineering domains. In this paper, we propose a practical definition of FAIR principles for AI models and create a FAIR AI project template that promotes adherence to these principles. We demonstrate how to implement these principles using a concrete example from experimental high energy physics: a graph neural network for identifying Higgs bosons decaying to bottom quarks. We study the robustness of these FAIR AI models and their portability across hardware architectures and software frameworks, and report new insights on the interpretability of AI predictions by studying the interplay between FAIR datasets and AI models. Enabled by publishing FAIR AI models, these studies pave the way toward reliable and automated AI-driven scientific discovery.

translated by 谷歌翻译

Neural Cell Video Synthesis via Optical-Flow Diffusion

Manuel Serna-Aguilera , Khoa Luu , Nathaniel Harris , Min Zou

分类：计算机视觉

2022-12-06

The biomedical imaging world is notorious for working with small amounts of data, frustrating state-of-the-art efforts in the computer vision and deep learning worlds. With large datasets, it is easier to make progress we have seen from the natural image distribution. It is the same with microscopy videos of neuron cells moving in a culture. This problem presents several challenges as it can be difficult to grow and maintain the culture for days, and it is expensive to acquire the materials and equipment. In this work, we explore how to alleviate this data scarcity problem by synthesizing the videos. We, therefore, take the recent work of the video diffusion model to synthesize videos of cells from our training dataset. We then analyze the model's strengths and consistent shortcomings to guide us on improving video generation to be as high-quality as possible. To improve on such a task, we propose modifying the denoising function and adding motion information (dense optical flow) so that the model has more context regarding how video frames transition over time and how each pixel changes over time.

translated by 谷歌翻译

A Cross-Conformal Predictor for Multi-label Classification

Harris Papadopoulos

分类：机器学习

2022-11-29

Unlike the typical classification setting where each instance is associated with a single class, in multi-label learning each instance is associated with multiple classes simultaneously. Therefore the learning task in this setting is to predict the subset of classes to which each instance belongs. This work examines the application of a recently developed framework called Conformal Prediction (CP) to the multi-label learning setting. CP complements the predictions of machine learning algorithms with reliable measures of confidence. As a result the proposed approach instead of just predicting the most likely subset of classes for a new unseen instance, also indicates the likelihood of each predicted subset being correct. This additional information is especially valuable in the multi-label setting where the overall uncertainty is extremely high.

translated by 谷歌翻译

Industry-Scale Orchestrated Federated Learning for Drug Discovery

Martijn Oldenhof , Gergely Ács , Balázs Pejó , Ansgar Schuffenhauer , Nicholas Holway , Noé Sturm , Arne Dieckmann , Oliver Fortmeier , Eric Boniface , Clément Mayer

分类：机器学习 | (统计)机器学习

2022-10-17

To apply federated learning to drug discovery we developed a novel platform in the context of European Innovative Medicines Initiative (IMI) project MELLODDY (grant n{\deg}831472), which was comprised of 10 pharmaceutical companies, academic research labs, large industrial companies and startups. The MELLODDY platform was the first industry-scale platform to enable the creation of a global federated model for drug discovery without sharing the confidential data sets of the individual partners. The federated model was trained on the platform by aggregating the gradients of all contributing partners in a cryptographic, secure way following each training iteration. The platform was deployed on an Amazon Web Services (AWS) multi-account architecture running Kubernetes clusters in private subnets. Organisationally, the roles of the different partners were codified as different rights and permissions on the platform and administrated in a decentralized way. The MELLODDY platform generated new scientific discoveries which are described in a companion paper.

translated by 谷歌翻译

B2B Advertising: Joint Dynamic Scoring of Account and Users

Atanu R. Sinha , Gautam Choudhary , Mansi Agarwal , Shivansh Bindal , Abhishek Pande , Camille Girabawe

分类：机器学习

2022-09-28

当一家企业向另一家企业（B2B）出售时，购买业务由一组称为帐户的个人代表，他们共同决定是否购买。卖方向每个人做广告，并与他们互动，主要是通过数字方式进行的。销售周期很长，通常在几个月内。在寻求信息时，属于帐户的个人之间存在异质性，因此卖方需要在漫长的视野中对每个人的利益进行评分，以决定必须达到哪些人以及何时达到。此外，购买决定与帐户有关，必须进行评分才能投射购买的可能性，这一决定可能会一直变化，直到实际的决定，象征组决策。我们以动态的方式为帐户及其个人的决定分数。动态评分允许机会在长时间的不同时间点影响不同的单个成员。数据集包含与卖方的每个人通信活动的行为日志；但是，没有关于个人之间咨询的数据，这导致了决定。使用神经网络体系结构，我们提出了几种方法来汇总各个成员活动的信息，以预测该小组的集体决策。多次评估发现了强大的模型性能。

translated by 谷歌翻译

Adversarial Stain Transfer to Study the Effect of Color Variation on Cell Instance Segmentation

Huaqian Wu , Nicolas Souedet , Camille Mabillon , Caroline Jan , Cédric Clouchoux , Thierry Delzescaux

分类：计算机视觉

2022-09-01

由多种因素引起的组织学图像的染色变化不仅是病理学家的视觉诊断，而且是细胞分割算法的挑战。为了消除颜色变化，已经提出了许多染色归一化方法。但是，大多数是为苏木精和曙红染色图像而设计的，并且在免疫组织化学染色图像上表现不佳。当前的细胞分割方法系统地将染色归一化作为预处理步骤，但是尚未定量研究颜色变化带来的影响。在本文中，我们制作了五组具有不同颜色的Neun染色图像。我们应用了一种深度学习的图像录制方法来在组织学图像组之间执行色彩转移。最后，我们改变了分割集的颜色，并量化了颜色变化对细胞分割的影响。结果证明了在后续分析之前必须进行颜色归一化的必要性。

translated by 谷歌翻译

HTML版本

Individual Tree Detection in Large-Scale Urban Environments using High-Resolution Multispectral Imagery

Jonathan Ventura , Milo Honsberger , Cameron Gonsalves , Julian Rice , Camille Pawlak , Natalie L. R. Love , Skyler Han , Viet Nguyen , Keilana Sugano , Jacqueline Doremus

分类：计算机视觉

2022-08-22

我们介绍了一种新颖的深度学习方法，用于使用高分辨率的多光谱空中图像在城市环境中检测单个树木。我们使用卷积神经网络来回归一个置信图，指示单个树的位置，该位置是使用峰查找算法本地化的。我们的方法通过检测公共和私人空间中的树木来提供完整的空间覆盖范围，并可以扩展到很大的区域。在我们的研究区域，跨越南加州的五个城市，我们的F评分为0.735，RMSE为2.157 m。我们使用我们的方法在加利福尼亚城市森林中生产所有树木的地图，这表明我们有可能在前所未有的尺度上支持未来的城市林业研究。

translated by 谷歌翻译

Neural Embedding: Learning the Embedding of Manifold of Physics Data

Sang Eon Park , Philip Harris , Bryan Ostdiek

分类：机器学习

2022-08-10

在本文中，我们提出了一种将标准结构嵌入物理数据歧管的方法，该方法具有更简单的指标，例如欧几里得和双曲线空间。然后，我们证明这可能是许多应用程序数据分析管道中的有力一步。在大型强子对撞机上使用逐渐更现实的模拟碰撞，我们表明这种嵌入方法了解了潜在的潜在结构。在欧几里得空间中的体积概念中，我们首次提供了一种可行的解决方案，可以量化对撞机物理学中模型不可知的搜索算法的真实搜索能力（即异常检测）。最后，我们讨论了如何采用本文中提出的思想来解决许多实践挑战，这些挑战需要从复杂的高维数据集中提取物理有意义的表示形式。

translated by 谷歌翻译